Skip to content

Draft: Real vectors for tests #980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from
Open

Draft: Real vectors for tests #980

wants to merge 7 commits into from

Conversation

I8dNLo
Copy link
Contributor

@I8dNLo I8dNLo commented Apr 29, 2025

Using real vectors for tests proof-of-concept

Copy link

netlify bot commented Apr 29, 2025

Deploy Preview for poetic-froyo-8baba7 ready!

Name Link
🔨 Latest commit 71ad426
🔍 Latest deploy log https://app.netlify.com/projects/poetic-froyo-8baba7/deploys/682f9d6e8882f90008ed6a43
😎 Deploy Preview https://deploy-preview-980--poetic-froyo-8baba7.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@I8dNLo I8dNLo changed the base branch from master to dev April 29, 2025 10:27
Copy link

coderabbitai bot commented Apr 29, 2025

📝 Walkthrough

"""

Walkthrough

The changes unify vector size constants across test modules by replacing hardcoded values with a single imported variable text_vector_size. Random vector generation in tests is replaced with calls to sample_queries from tests.fixtures.points, ensuring consistent and reproducible test vectors. The tests/fixtures/points.py module was extended to load and preprocess external vector data, provide a sample_queries function, and update vector generation functions to optionally return preloaded vectors. Tests involving multivector queries now use the new sample_random_multivector function. These modifications affect test data initialization and vector size usage but do not alter public API signatures except for added functions in tests/fixtures/points.py.

Suggested reviewers

  • coszio
    """

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

🛑 Comments failed to post (37)
docs/source/qdrant_client.http.api.snapshots_api.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the HTTP API index.
Ensure this new docs file is referenced in docs/source/qdrant_client.http.api.rst so that the snapshots API appears in the HTTP API toctree:


🏁 Script executed:

rg -n "snapshots_api.rst" docs/source/qdrant_client.http.api.rst

Length of output: 64


Add snapshots_api.rst to HTTP API toctree
The new docs/source/qdrant_client.http.api.snapshots_api.rst file isn’t referenced in the HTTP API index. Please update:

  • File: docs/source/qdrant_client.http.api.rst
    Under the existing .. toctree:: directive, add:
       snapshots_api.rst
docs/source/qdrant_client.http.api.cluster_api.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the HTTP API index.
Make sure this file is added to docs/source/qdrant_client.http.api.rst so the cluster API is included:


🏁 Script executed:

rg -n "cluster_api.rst" docs/source/qdrant_client.http.api.rst

Length of output: 62


Add cluster_api.rst to HTTP API docs index
The cluster_api.rst page isn’t referenced in docs/source/qdrant_client.http.api.rst, so the cluster endpoints won’t be included in the generated documentation. Please update the toctree in that file:

• File needing update:
docs/source/qdrant_client.http.api.rst

• Example patch:

 .. toctree::
    :maxdepth: 2

    collection_api.rst
+   cluster_api.rst
    # …other modules
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# docs/source/qdrant_client.http.api.rst

.. toctree::
   :maxdepth: 2

   collection_api.rst
   cluster_api.rst
   # …other modules
docs/source/qdrant_client.fastembed_common.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the main documentation.
Ensure this file appears in your toctree (e.g., in docs/source/qdrant_client.rst or docs/source/index.rst):


🏁 Script executed:

rg -n "fastembed_common.rst" docs/source/qdrant_client.rst docs/source/index.rst

Length of output: 80


Add fastembed_common.rst to the main toctree
The file docs/source/qdrant_client.fastembed_common.rst is not referenced in any toctree in docs/source/qdrant_client.rst or docs/source/index.rst.

• In docs/source/qdrant_client.rst (or docs/source/index.rst), insert:

.. toctree::
   :maxdepth: 2
   :caption: Qdrant Client Reference

   qdrant_client.fastembed_common
.github/workflows/integration-tests.yml (2)

60-68: 💡 Verification agent

🧩 Analysis chain

Ensure HF_TOKEN is correctly documented

The addition of the Hugging Face token suggests integration with embedding models, which aligns with the PR objective of using real vectors for tests. Ensure this requirement is documented for developers who will run these tests locally.


🏁 Script executed:

#!/bin/bash
# Check if HF_TOKEN usage is documented in the project

echo "Checking for HF_TOKEN documentation in README or docs..."
rg -i 'HF_TOKEN|hugging\s*face\s*token' --type md .
rg -i 'HF_TOKEN|hugging\s*face\s*token' --type rst .
rg -i 'HF_TOKEN|hugging\s*face\s*token' README.md || echo "No mention of HF_TOKEN in README.md"

echo "Checking for HF_TOKEN usage in test code..."
rg -i 'HF_TOKEN|hugging\s*face' --type py ./tests

Length of output: 513


Add HF_TOKEN setup documentation for local integration tests

The HF_TOKEN environment variable is now required to run the integration tests, but there’s no mention of this in the project docs. Please document how to obtain and set HF_TOKEN for local development and CI.

• In README.md (or under a new “Integration Tests” section), include:

  • A note that HF_TOKEN must be set before running ./tests/integration-tests.sh
  • Instructions or a link on how to generate an access token from Hugging Face (e.g., via https://huggingface.co/settings/tokens)
  • Examples for setting the variable in shell (export HF_TOKEN=…) or in a .env file

• If you maintain separate contributor or developer docs, add the same information there.


40-51: 💡 Verification agent

🧩 Analysis chain

Verify if conditional tests should run on all Python versions

The async client and inspection cache tests are currently limited to Python 3.10.x. Consider whether these should also run on other Python versions to ensure consistent behavior across environments.


🏁 Script executed:

#!/bin/bash
# Check if there are any Python version-specific dependencies or features used in the async client
# or inspection cache that would justify limiting these tests to Python 3.10.x only

echo "Checking for Python 3.10-specific imports or features in async client code..."
rg -i 'python\s*3\.10' --type py ./qdrant_client
rg -i 'match\s+.*:' --type py ./qdrant_client  # Pattern matching (3.10 feature)
rg -i 'PEP.*?585' --type py ./qdrant_client

echo "Checking if test scripts have any Python version checks..."
cat ./tests/async-client-consistency-check.sh
cat ./tests/inspection-cache-consistency-check.sh

Length of output: 2715


I don’t see any Python-3.10-only constructs in the client code or test scripts, and the tooling scripts themselves invoke generic python (rather than pinning to 3.10). The consistency-check steps should work on all supported Python versions.

Update the workflow to run these tests on every Python version

 .github/workflows/integration-tests.yml
@@ lines 40-51
-      - name: Run async client generation tests
-        run: |
-          if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then
-            ./tests/async-client-consistency-check.sh
-          fi
-        shell: bash
+      - name: Run async client generation tests
+        run: ./tests/async-client-consistency-check.sh
+        shell: bash

-      - name: Run inspection cache population tests
-        run: |
-          if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then
-            ./tests/inspection-cache-consistency-check.sh
-          fi
-        shell: bash
+      - name: Run inspection cache population tests
+        run: ./tests/inspection-cache-consistency-check.sh
+        shell: bash

This ensures both checks execute across all Python versions in your CI matrix.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

      - name: Run async client generation tests
        run: ./tests/async-client-consistency-check.sh
        shell: bash

      - name: Run inspection cache population tests
        run: ./tests/inspection-cache-consistency-check.sh
        shell: bash
qdrant_client/embed/models.py (1)

1-12: ⚠️ Potential issue

Add from __future__ import annotations for Python 3.8 compatibility

list[...] generics are evaluated at runtime and therefore raise TypeError: 'type' object is not subscriptable on Python 3.8.
Unless the client has officially dropped 3.8 support, this will be a runtime blocker.

+from __future__ import annotations
 from typing import Union

Alternatively, fall back to typing.List / typing.Dict, but the future-import keeps the code cleaner.

Also applies to: 13-23

qdrant_client/embed/type_inspector.py (1)

42-50: ⚠️ Potential issue

Avoid premature return False when iterating collections

inspect() exits on the first non-BaseModel element, potentially missing valid models later in the iterable.

-                else:
-                    return False
+                else:
+                    # Skip non-pydantic items and continue inspecting
+                    continue

With this change, the loop continues scanning the remaining elements, ensuring no false negatives.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        elif isinstance(points, Iterable):
            for point in points:
                if isinstance(point, BaseModel):
                    self.parser.parse_model(point.__class__)
                    if self._inspect_model(point):
                        return True
                else:
                    # Skip non-pydantic items and continue inspecting
                    continue
        return False
qdrant_client/common/version_check.py (1)

12-18: 🛠️ Refactor suggestion

Add timeout & exception handling to avoid client hangs

httpx.get is executed without a timeout and outside a try/except block.
A transient network issue will block indefinitely or raise and crash the caller.

-    response = httpx.get(rest_uri, headers=rest_headers, auth=auth_provider)
+    try:
+        response = httpx.get(
+            rest_uri,
+            headers=rest_headers,
+            auth=auth_provider,
+            timeout=5.0,  # seconds – tune if necessary
+        )
+    except httpx.HTTPError as exc:
+        logging.debug(f"Failed to fetch server version: {exc!r}")
+        return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_server_version(
    rest_uri: str, rest_headers: dict[str, Any], auth_provider: Optional[BearerAuth]
) -> Optional[str]:
    try:
        response = httpx.get(
            rest_uri,
            headers=rest_headers,
            auth=auth_provider,
            timeout=5.0,  # seconds – tune if necessary
        )
    except httpx.HTTPError as exc:
        logging.debug(f"Failed to fetch server version: {exc!r}")
        return None

    if response.status_code == 200:
        version_info = response.json().get("version", None)
qdrant_client/hybrid/fusion.py (1)

7-11: 🛠️ Refactor suggestion

Rank index should be 1-based in Reciprocal Rank Fusion

Classical RRF uses 1 / (k + rank) where rank starts at 1 (best item).
Using the 0-based loop index over-weights every entry by 1.

-    def compute_score(pos: int) -> float:
+    def compute_score(pos: int) -> float:
         ranking_constant = (
             2  # the constant mitigates the impact of high rankings by outlier systems
         )
-        return 1 / (ranking_constant + pos)
+        return 1 / (ranking_constant + pos + 1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def compute_score(pos: int) -> float:
        ranking_constant = (
            2  # the constant mitigates the impact of high rankings by outlier systems
        )
        return 1 / (ranking_constant + pos + 1)
qdrant_client/_pydantic_compat.py (2)

65-67: 🛠️ Refactor suggestion

Return a homogeneous type from model_config

For Pydantic v2 the function now returns a ConfigDict, not a plain dict, which breaks the declared return annotation and downstream dict operations (e.g. model_config(...)["populate_by_name"]). Wrap with dict() for v2 to keep behaviour identical across versions.

-        return model.model_config
+        return dict(model.model_config)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def model_config(model: Type[BaseModel]) -> dict[str, Any]:
    if PYDANTIC_V2:
-        return model.model_config
+        return dict(model.model_config)

19-20: ⚠️ Potential issue

Prevent possible KeyError in to_jsonable_python for Pydantic v1

ENCODERS_BY_TYPE[type(x)] will raise when type(x) has no registered encoder (e.g., Decimal). A safe fallback keeps the helper usable for arbitrary objects:

-    def to_jsonable_python(x: Any) -> Any:
-        return ENCODERS_BY_TYPE[type(x)](x)
+    def to_jsonable_python(x: Any) -> Any:
+        encoder = ENCODERS_BY_TYPE.get(type(x))
+        return encoder(x) if encoder else json.loads(json.dumps(x, default=str))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def to_jsonable_python(x: Any) -> Any:
        encoder = ENCODERS_BY_TYPE.get(type(x))
        return encoder(x) if encoder else json.loads(json.dumps(x, default=str))
qdrant_client/http/api_client.py (2)

86-89: 🛠️ Refactor suggestion

timeout is copied but not removed from query params

After converting "timeout" to an integer you still leave "timeout" inside params, so the same value is sent twice: once as an HTTP query parameter and once as the HTTPX timeout. This can confuse the server and breaks caching. Remove it from params after extraction.

-        if "params" in kwargs and "timeout" in kwargs["params"]:
-            kwargs["timeout"] = int(kwargs["params"]["timeout"])
+        if "params" in kwargs and "timeout" in kwargs["params"]:
+            kwargs["timeout"] = int(kwargs["params"].pop("timeout"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        url = urljoin((self.host or ""), url.format(**path_params))
        if "params" in kwargs and "timeout" in kwargs["params"]:
            kwargs["timeout"] = int(kwargs["params"].pop("timeout"))
        request = self._client.build_request(method, url, **kwargs)

109-120: ⚠️ Potential issue

Robust JSON parsing when handling 429 responses

Indexing resp["status"] will raise KeyError if the field is absent. Use .get() chaining to avoid masking the original error handling with an accidental KeyError.

-                resp = response.json()
-                message = resp["status"]["error"] if resp["status"] and resp["status"]["error"] else ""
+                resp = response.json()
+                message = (
+                    resp.get("status", {}).get("error")  # graceful if key missing
+                    or ""
+                )

Replicate the same fix in the async branch (lines 193-203).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if response.status_code == 429:
            retry_after_s = response.headers.get("Retry-After", None)
            try:
                resp = response.json()
                message = (
                    resp.get("status", {}).get("error")  # graceful if key missing
                    or ""
                )
            except Exception:
                message = ""

            if retry_after_s:
                raise ResourceExhaustedResponse(message, retry_after_s)

        if response.status_code in [200, 201, 202]:
qdrant_client/http/api/aliases_api.py (1)

18-44: 💡 Verification agent

🧩 Analysis chain

Consider extracting common utility functions to a shared module

The JSON serialization utilities (to_json and jsonable_encoder) are duplicated across multiple API modules. Consider extracting these to a shared utility module to reduce duplication.


🏁 Script executed:

#!/bin/bash
# Check how many files contain duplicated JSON serialization utility functions

echo "Files with to_json function:"
rg "def to_json\(model: BaseModel" --type py qdrant_client/

echo "Files with jsonable_encoder function:"
rg "def jsonable_encoder\(" --type py qdrant_client/

Length of output: 1745


Extract common JSON serialization utilities to a shared module

The to_json and jsonable_encoder functions are duplicated in nine API modules. Centralize them into a single utility file to reduce maintenance overhead and ensure consistency.

Files containing the duplication:

  • qdrant_client/http/api/aliases_api.py
  • qdrant_client/http/api/beta_api.py
  • qdrant_client/http/api/collections_api.py
  • qdrant_client/http/api/distributed_api.py
  • qdrant_client/http/api/indexes_api.py
  • qdrant_client/http/api/points_api.py
  • qdrant_client/http/api/search_api.py
  • qdrant_client/http/api/service_api.py
  • qdrant_client/http/api/snapshots_api.py

Suggested steps:

  1. Create a new file, e.g. qdrant_client/http/api/utils.py, containing the shared functions:
    # qdrant_client/http/api/utils.py
    from typing import Any, Union, SetIntStr, DictIntStrAny
    from pydantic import BaseModel
    from .version import PYDANTIC_V2
    
    def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
        if PYDANTIC_V2:
            return model.model_dump_json(*args, **kwargs)
        return model.json(*args, **kwargs)
    
    def jsonable_encoder(
        obj: Any,
        include: Union[SetIntStr, DictIntStrAny] = None,
        exclude=None,
        by_alias: bool = True,
        skip_defaults: bool = None,
        exclude_unset: bool = True,
        exclude_none: bool = True,
    ):
        if hasattr(obj, "json") or hasattr(obj, "model_dump_json"):
            return to_json(
                obj,
                include=include,
                exclude=exclude,
                by_alias=by_alias,
                exclude_unset=bool(exclude_unset or skip_defaults),
                exclude_none=exclude_none,
            )
        return obj
  2. In each API module, remove the local definitions and update imports:
    - from pydantic import BaseModel
    - from .version import PYDANTIC_V2
    -
    - def to_json(…): …
    - def jsonable_encoder(…): …
    + from .utils import to_json, jsonable_encoder

This refactor will eliminate duplication and centralize future changes.

qdrant_client/http/api/search_api.py (1)

7-8: 🛠️ Refactor suggestion

Avoid import * – switch to explicit imports

Star imports obfuscate the public surface, make static analysis harder, and are flagged by Ruff (F403/F405). Import the required symbols explicitly or keep the models as m alias only.

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m            # single, explicit entry-point
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m  # single, explicit entry-point
🧰 Tools
🪛 Ruff (0.8.2)

7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/hybrid/formula.py (2)

109-116: ⚠️ Potential issue

exponent.is_integer() fails when exponent is int

int objects have no is_integer() method. When exponent is an int, this raises AttributeError.

-        if base >= 0 or (base != 0 and exponent.is_integer()):
+        if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))):

Alternatively, cast to float and guard with math.isfinite.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # Check for valid input
-        if base >= 0 or (base != 0 and exponent.is_integer()):
+        if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))):
            try:
                return math.pow(base, exponent)
            except OverflowError:
                pass

        raise_non_finite_error(f"{base}^{exponent}")

348-358: ⚠️ Potential issue

Replace assert False in tests

assert False is stripped under optimisation (python -O). Use pytest.raises or raise AssertionError directly.

-        parse_variable("$score[invalid]")
-        assert False
+        with pytest.raises(ValueError):
+            parse_variable("$score[invalid]")

Repeat for the second occurrence.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

350-350: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


356-356: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

pyproject.toml (1)

25-35: 🛠️ Refactor suggestion

Consider upper-bounding protobuf & numpy to avoid unexpected breaking majors

protobuf = ">=3.20.0" and numpy for non-3.9 ranges are currently un-bounded on the upper side. Both libraries regularly ship backwards-incompatible major releases (e.g., protobuf 5.*). A conservative constraint such as <5 (protobuf) and <2.2 (numpy) gives you time to test new majors before the ecosystem forces them in.

-protobuf = ">=3.20.0"
+protobuf = ">=3.20.0,<5"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

pydantic = ">=1.10.8,!=2.0.*,!=2.1.*,!=2.2.0"  # can't use poetry ">=1.10.8,<2.0 || >=2.2.1" since pip is complaining
grpcio = { version = ">=1.41.0", allow-prereleases = true }
- protobuf = ">=3.20.0"
+ protobuf = ">=3.20.0,<5"
urllib3 = ">=1.26.14,<3"
portalocker = "^2.7.0"
fastembed = [
    { version = "0.6.1", optional = true },
]
fastembed-gpu = [
    { version = "0.6.1", optional = true },
]
qdrant_client/connection.py (4)

98-103: ⚠️ Potential issue

stream-stream interceptor: remove unnecessary await

grpc.aio.StreamStreamCall isn’t awaitable. Replace with the pattern used above.


89-94: ⚠️ Potential issue

stream-unary interceptor suffers from identical pattern

Apply the same fix (call = continuation(...)) and adjust the return path.


70-76: ⚠️ Potential issue

Double-await bug: intercept_unary_unary breaks on real calls

response is awaited once here and converted to a plain protobuf message.
process_response then tries to await it again, raising TypeError: object ResponseMessage is not awaitable.

-        response = await continuation(new_details, next_request)
-        return await postprocess(response) if postprocess else response
+        call = continuation(new_details, next_request)          # call is awaitable
+        if postprocess:
+            return await postprocess(call)
+        return await call
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        new_details, new_request_iterator, postprocess = await self._fn(
            client_call_details, iter((request,)), False, False
        )
        next_request = next(new_request_iterator)
        call = continuation(new_details, next_request)
        if postprocess:
            return await postprocess(call)
        return await call

80-85: ⚠️ Potential issue

Same double-await issue for unary-stream

For streaming responses the call object is not awaitable at all, so the current await continuation(...) raises immediately.

-        response_it = await continuation(new_details, next(new_request_iterator))
-        return await postprocess(response_it) if postprocess else response_it
+        call = continuation(new_details, next(new_request_iterator))
+        return await postprocess(call) if postprocess else call
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        new_details, new_request_iterator, postprocess = await self._fn(
            client_call_details, iter((request,)), False, True
        )
        call = continuation(new_details, next(new_request_iterator))
        return await postprocess(call) if postprocess else call
qdrant_client/conversions/common_types.py (1)

68-75: ⚠️ Potential issue

list[...] breaks Py < 3.9 compatibility

list[PointId] and similar PEP-585 generics are only valid syntax starting with Python 3.9.
The header comment (lines 143-145) says we still support Python 3.7, so importing this
module on 3.7/3.8 will raise SyntaxError before users can even reach the runtime
version-check in their own code.

-PointsSelector = Union[
-    list[PointId],
+from typing import List  # at top of file
+PointsSelector = Union[
+    List[PointId],

Update every other occurrence of the builtin-generic syntax (list[...], dict[...])
in this file (there are several) or drop 3.7/3.8 from the supported versions list.

Committable suggestion skipped: line range outside the PR's diff.

qdrant_client/http/api/distributed_api.py (2)

4-7: 🛠️ Refactor suggestion

Duplicate & wildcard imports clutter the namespace

Lines 4-7 import BaseModel twice and pull everything from
qdrant_client.http.models via *. Apart from the Ruff warnings, this:

  • masks the second BaseModel reference,
  • makes static analysis and IDE completion much harder,
  • risks name collisions.
-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m

(The as m alias is already used below.)
Please remove the redundant import and avoid import *.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m
 from pydantic.version import VERSION as PYDANTIC_VERSION
🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)


18-23: ⚠️ Potential issue

to_json does not propagate exclude_unset/none to Pydantic v2 correctly

model_dump_json() takes exclude_none but not exclude_unset; passing an
unexpected kwarg raises TypeError.
Because jsonable_encoder always forwards exclude_unset=True, code paths running
under Pydantic v2 will explode.

Consider splitting the kwargs explicitly:

-def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
-    if PYDANTIC_V2:
-        return model.model_dump_json(*args, **kwargs)
+def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
+    if PYDANTIC_V2:
+        kwargs.pop("exclude_unset", None)  # not supported
+        return model.model_dump_json(*args, **kwargs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
    if PYDANTIC_V2:
        kwargs.pop("exclude_unset", None)  # not supported
        return model.model_dump_json(*args, **kwargs)
    else:
        return model.json(*args, **kwargs)
qdrant_client/embed/model_embedder.py (2)

32-35: ⚠️ Potential issue

ModelEmbedderWorker.start() causes TypeError

start() passes an unexpected threads keyword to ModelEmbedderWorker.__init__,
which only accepts batch_size and **kwargs.

-        return cls(threads=1, batch_size=batch_size, **kwargs)
+        # ParallelWorkerPool already sets the number of processes; no need for “threads”.
+        return cls(batch_size=batch_size, **kwargs)

Without this fix every attempt to spin up a worker process fails immediately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    @classmethod
    def start(cls, batch_size: int, **kwargs: Any) -> "ModelEmbedderWorker":
        # ParallelWorkerPool already sets the number of processes; no need for “threads”.
        return cls(batch_size=batch_size, **kwargs)

268-273: 🛠️ Refactor suggestion

Early return inside list traversal skips subsequent inference objects

If the first element of a list is not an inference object we return None
prematurely, silently discarding the rest of the list.

-        if isinstance(data, list):
-            for value in data:
-                if not isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
-                    return None
-                self._accumulate(value)
+        if isinstance(data, list):
+            for value in data:
+                if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
+                    self._accumulate(value)
+            return None

This ensures all items are inspected.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if isinstance(data, list):
            for value in data:
                if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
                    self._accumulate(value)
            return None
qdrant_client/http/api/service_api.py (1)

4-7: 🛠️ Refactor suggestion

Remove the duplicate BaseModel import and avoid import *.

BaseModel is imported twice (lines 4 & 5), triggering Ruff F811 and creating ambiguous symbols.
In addition, the wildcard import on line 7 hides the public surface and hurts IDE auto-completion.

-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m

(The last line lets you access generated models through the existing m alias.)
This keeps the namespace clean and eliminates the redefinition warning.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from pydantic import BaseModel
from pydantic.version import VERSION as PYDANTIC_VERSION
from qdrant_client.http import models as m
🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/http/api/collections_api.py (1)

4-7: 🛠️ Refactor suggestion

Same duplicate import & wildcard issue as in service_api.py.

Please apply the import-cleanup change here as well to silence F811/F403 and keep the public namespace tidy.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/grpc/collections_pb2.py (1)

50-52: ⚠️ Potential issue

Avoid polluting the module namespace with None.

globals()['None'] = 0 creates a module attribute called None and risks severe confusion:

from qdrant_client.grpc.collections_pb2 import None  # now legal!

Although it doesn’t overwrite the built-in constant, importing this symbol elsewhere shadows the real None, leading to hard-to-trace bugs.
If the enum generator must expose a “None” value, prefer a safer alias such as Modifier_None = 0.

qdrant_client/embed/embedder.py (2)

278-286: 🛠️ Refactor suggestion

⚠️ Potential issue

**options or {} triggers a SyntaxError

Inside a call expression the ** operator must receive a single mapping expression.
Using or directly after ** is illegal Python syntax and the file will not import.

Refactor once and reuse locally-normalised kwargs:

-embedding_model_inst = self.get_or_init_model(model_name=model_name, **options or {})
+model_opts: dict[str, Any] = options or {}
+embedding_model_inst = self.get_or_init_model(model_name=model_name, **model_opts)

Consider extracting a small helper such as _normalize_options(options) to DRY this pattern across all _embed_* helpers.

Also applies to: 296-304, 327-338, 348-356, 365-373, 380-385


26-30: ⚠️ Potential issue

Syntax error: invalid keyword in class definition

arbitrary_types_allowed=True cannot be passed as a keyword argument in the class header – only metaclass (and since 3.11, slots) are allowed.
Move that flag into a Pydantic configuration block to avoid a hard runtime SyntaxError.

-class ModelInstance(BaseModel, Generic[T], arbitrary_types_allowed=True):  # type: ignore[call-arg]
-    model: T
-    options: dict[str, Any]
-    deprecated: bool = False
+class ModelInstance(Generic[T], BaseModel):
+    model: T
+    options: dict[str, Any]
+    deprecated: bool = False
+
+    class Config:  # Pydantic v1
+        arbitrary_types_allowed = True
+
+    model_config = {  # Pydantic v2 (ignored by v1)
+        "arbitrary_types_allowed": True,
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class ModelInstance(Generic[T], BaseModel):
    model: T
    options: dict[str, Any]
    deprecated: bool = False

    class Config:  # Pydantic v1
        arbitrary_types_allowed = True

    model_config = {  # Pydantic v2 (ignored by v1)
        "arbitrary_types_allowed": True,
    }
qdrant_client/http/api/snapshots_api.py (1)

2-7: 🛠️ Refactor suggestion

Duplicate import & star-import hide real problems

  1. BaseModel is imported twice (line 4 and 5).
  2. from qdrant_client.http.models import * makes type-checking noisy and is the reason SnapshotPriority appears “possibly undefined”.
-from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union
-from pydantic import BaseModel
-from pydantic.main import BaseModel
+from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union
+from pydantic import BaseModel

Replace the star import with explicit names (or keep the models as m alias only) and import SnapshotPriority directly:

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m, SnapshotPriority

This eliminates re-definitions, improves IDE support, and resolves the Ruff warnings F403/F811/F405.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/http/api/points_api.py (1)

2-8: 🛠️ Refactor suggestion

Import hygiene: duplicated BaseModel and uncontrolled namespace

Same issue as in snapshots_api.py:

  • Redundant BaseModel import (line 4 & 5)
  • from qdrant_client.http.models import * obscures missing symbols such as WriteOrdering.

Adopt the same fix (keep a single BaseModel import and replace the star import with explicit names or the existing alias m). This will also silence Ruff F403/F811/F405.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/async_qdrant_fastembed.py (1)

398-425: 🛠️ Refactor suggestion

⚠️ Potential issue

Avoid relying on assert for production-level validation

assert statements are stripped when Python is executed with the -O (optimized) flag, which would silently bypass all the collection-compatibility checks. Converting these into explicit exceptions keeps the safeguards intact in all execution modes and allows you to surface a precise error type/message to the caller.

-assert (
-    embeddings_size == vector_params.size
-), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+if embeddings_size != vector_params.size:
+    raise ValueError(
+        f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+    )

(…repeat for the remaining assert statements in this block…)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        (embeddings_size, distance) = self._get_model_params(model_name=self.embedding_model_name)
        vector_field_name = self.get_vector_field_name()
        assert isinstance(
            collection_info.config.params.vectors, dict
        ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}"
        assert (
            vector_field_name in collection_info.config.params.vectors
        ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}, expected {vector_field_name}"
        vector_params = collection_info.config.params.vectors[vector_field_name]
-       assert (
-           embeddings_size == vector_params.size
-       ), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+       if embeddings_size != vector_params.size:
+           raise ValueError(
+               f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+           )
        assert (
            distance == vector_params.distance
        ), f"Distance mismatch: {distance} != {vector_params.distance}"
        sparse_vector_field_name = self.get_sparse_vector_field_name()
        if sparse_vector_field_name is not None:
            assert (
                sparse_vector_field_name in collection_info.config.params.sparse_vectors
            ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}"
            if self.sparse_embedding_model_name in IDF_EMBEDDING_MODELS:
                modifier = collection_info.config.params.sparse_vectors[
                    sparse_vector_field_name
                ].modifier
                assert (
                    modifier == models.Modifier.IDF
                ), f"{self.sparse_embedding_model_name} requires modifier IDF, current modifier is {modifier}"
qdrant_client/async_client_base.py (1)

382-386: 🛠️ Refactor suggestion

Inconsistent “sync” method in an async interface

upload_points is defined as a synchronous method inside an otherwise asynchronous base class.
All real implementations that hit the network or disk will need an async API to avoid blocking the event loop.

Consider:

-    def upload_points(
+    async def upload_points(
         self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any
     ) -> None:

and mark it @abstractmethod (see next comment).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def upload_points(
        self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any
    ) -> None:
        raise NotImplementedError()
qdrant_client/local/async_qdrant_local.py (1)

916-961: ⚠️ Potential issue

Critical bug: inconsistent initial scroll limit
In create_collection(), the first scroll uses limit=2 while subsequent loops use batch_size = 100. This will only copy two points initially instead of the intended batch size. It should read:

-            (records, next_offset) = await self.scroll(
-                from_collection_name, limit=2, with_vectors=True
-            )
+            (records, next_offset) = await self.scroll(
+                from_collection_name, limit=batch_size, with_vectors=True
+            )

Please align the initial limit with batch_size to ensure consistent data copying.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if src_collection and from_collection_name:
            batch_size = 100
            (records, next_offset) = await self.scroll(
                from_collection_name, limit=batch_size, with_vectors=True
            )
            self.upload_records(collection_name, records)
            while next_offset is not None:
                (records, next_offset) = await self.scroll(
                    from_collection_name, offset=next_offset, limit=batch_size, with_vectors=True
                )
                self.upload_records(collection_name, records)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/congruence_tests/test_discovery.py (1)

25-25: Improved test consistency by using sample queries.

Replacing random vector generation with predefined sample queries creates more consistent test behavior. However, the function still accepts a dims parameter that's no longer used.

Consider updating the function signature to remove the unused parameter:

-def random_vector(dims: int) -> list[float]:
+def random_vector(_: int = None) -> list[float]:

Or better yet, rename the function to reflect its new behavior:

-def random_vector(dims: int) -> list[float]:
+def sample_vector(_: int = None) -> list[float]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5adcf93 and 1073e87.

📒 Files selected for processing (7)
  • tests/congruence_tests/test_collections.py (1 hunks)
  • tests/congruence_tests/test_common.py (1 hunks)
  • tests/congruence_tests/test_discovery.py (1 hunks)
  • tests/congruence_tests/test_group_search.py (4 hunks)
  • tests/congruence_tests/test_query.py (3 hunks)
  • tests/congruence_tests/test_query_batch.py (3 hunks)
  • tests/congruence_tests/test_recommendation.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/congruence_tests/test_recommendation.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (17)
tests/congruence_tests/test_collections.py (1)

104-104: Vector size standardization improves test consistency.

The vector size increase from 2 to 384 aligns with the standardized vector size used across the test suite. This change ensures all test vectors have consistent dimensions, creating more realistic test scenarios.

tests/congruence_tests/test_common.py (2)

17-19: Standardized vector dimensions improve test realism.

Updating all dense vector size constants to 384 creates consistency across different vector types and makes tests more representative of real-world scenarios. This change supports the PR objective of using real vectors for testing.


22-24: Standardized sparse vector dimensions improve consistency.

Unifying sparse vector dimensions to 384 (from previously varied sizes) ensures consistency between dense and sparse vector tests. This standardization better represents real-world usage patterns.

tests/congruence_tests/test_discovery.py (1)

19-19: Added sample_queries import for using real vectors.

Good addition of the import for sample_queries from fixtures, supporting the move away from random vectors to predefined real vectors for testing.

tests/congruence_tests/test_query_batch.py (4)

22-26: Improved imports for real vector testing.

Good addition of sample_queries import along with the existing random vector generation functions, supporting the transition to using real vectors for testing.


46-50: Enhanced test realism with sample queries for text vectors.

Replacing random vector generation with predefined sample queries creates more consistent and realistic test inputs. This systematic change to how text queries and prefetch queries are constructed aligns well with the PR objective.


58-59: Standardized approach to image and code vector queries.

Good consistency in applying the same pattern of using sample queries for both image and code vector queries. This ensures all vector types benefit from the improved test realism.

Also applies to: 66-67


107-114: Improved DBSF fusion query tests with real vectors.

Using sample queries for the DBSF fusion prefetch queries completes the systematic replacement of random vectors, ensuring all test scenarios use realistic vectors.

tests/congruence_tests/test_group_search.py (5)

21-21: Good addition of the sample_queries import.

This change aligns with the PR's objective to use real vectors for testing, replacing random vectors with predefined samples.


30-33: Excellent standardization of test vectors.

Replacing random vector generation with predefined sample queries improves test determinism and consistency. This change ensures that tests use the same vector data across runs, making test results more reproducible.


222-223: Good standardization of vector sizes.

Using the text_vector_size constant instead of hardcoded values makes the code more maintainable and consistent with other tests. This change supports the standardization of vector sizes across the test suite.


233-239: Good update to query vector initialization.

Using sample_queries() instead of random vectors improves test consistency. The wrapping of the query vector in a NumPy array is maintained for the first test case, which preserves the existing test coverage for different input types.


241-243: Simplified function call by passing vector directly.

Passing the vector directly without unnecessary conversion improves code clarity. This change is consistent with the goal of standardizing how vectors are handled in tests.

tests/congruence_tests/test_query.py (4)

36-36: Good addition of the sample_queries import.

This change aligns with the PR's objective to standardize test vectors across the test suite.


50-58: Excellent standardization of query vector initialization.

Using predefined sample queries from sample_queries() instead of random vectors ensures test consistency and reproducibility. The modification to dense_vector_query_text_bis on line 56 (creating a slightly different vector by adding 42.0) is preserved, maintaining the existing test behavior while benefiting from standardized test data.


1465-1467: Good standardization of vector configuration.

Using the text_vector_size constant for vector configuration instead of hardcoded values improves consistency and maintainability. This change supports the standardization of vector sizes across the test suite.


1469-1469: Good update to fixture generation parameters.

Using the text_vector_size parameter in the generate_fixtures call ensures consistency with the vector configuration and other test files. This change further supports the standardization effort.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/fixtures/points.py (1)

141-142: Consider an edge case for the skip_vectors functionality

When using preloaded vectors and the skip_vectors option is enabled, there's no special handling for the case where vectors are skipped based on random chance (as seen in lines 143-146).

Consider how the skip_vectors functionality should work with preloaded vectors. If a vector is skipped, the index-to-vector mapping might become inconsistent.

You might want to add a comment explaining this behavior or adjust the code to handle this edge case properly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1073e87 and 5319ed7.

📒 Files selected for processing (1)
  • tests/fixtures/points.py (5 hunks)
🧰 Additional context used
🪛 GitHub Actions: Integration tests
tests/fixtures/points.py

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (4)
tests/fixtures/points.py (4)

26-42: Verify vector dimensionality matches

The function random_vectors now uses preloaded vectors when idx is provided. However, there's no validation to ensure that the dimensionality of the preloaded vectors matches the vector_sizes parameter.

Add validation to ensure that the dimensions of the preloaded vectors match the expected dimensions:

def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct:
    if isinstance(vector_sizes, int):
        if idx:
+            # Verify vector dimensions match
+            if len(_text_vectors_clean[idx]) != vector_sizes:
+                raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[idx])} does not match requested dimension {vector_sizes}")
            return _text_vectors_clean[idx]
        else:
            return np.random.random(vector_sizes).round(3).tolist()

This will help avoid subtle bugs when the dimensions don't match.


46-48: Handle potential index out of range errors

The random_multivectors function now uses sample_random_multivector but there's no check to ensure that vec_count is not larger than the total number of available vectors in _text_vectors_clean.

Add validation to ensure that vec_count doesn't exceed the number of available vectors:

def random_multivectors(vector_sizes: Union[dict[str, int], int]) -> models.VectorStruct:
    if isinstance(vector_sizes, int):
        vec_count = random.randint(1, 10)
+        vec_count = min(vec_count, len(_text_vectors_clean))
        return sample_random_multivector(vector_sizes, vec_count)

This will prevent index out of range errors.


51-53: Apply the same validation for multi-vector dictionary case

Similar to the previous comment, the multi-vector dictionary case needs the same validation to prevent potential index out of range errors.

Add validation for the dictionary case as well:

for vector_name, vector_size in vector_sizes.items():
    vec_count = random.randint(1, 10)
+    vec_count = min(vec_count, len(_text_vectors_clean))
    vectors[vector_name] = sample_random_multivector(vector_size, vec_count)

This ensures consistency between the two code paths.


124-125: Add bounds check for number of points

The generate_points function samples indices without checking if num_points exceeds the number of available vectors.

Add a validation to ensure num_points doesn't exceed the available vectors:

+ if num_points > len(_text_vectors_clean):
+     raise ValueError(f"Requested {num_points} points, but only {len(_text_vectors_clean)} vectors are available")
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=num_points, replace=False)

This prevents potential runtime errors when requesting more points than available vectors.

Comment on lines 13 to 15
_text_vectors = np.load("data/text.npy", mmap_mode="r")
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add fallback mechanism for missing data files

Currently, if the data files are missing, the test will fail completely. Consider adding a fallback mechanism that uses randomly generated vectors when the data files are not found.

Implement a try-except block to gracefully handle missing data files:

try:
    _text_vectors = np.load("data/text.npy", mmap_mode="r")
    _text_vectors_unique = np.unique(_text_vectors, axis=0)
    _text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
    _using_real_vectors = True
except FileNotFoundError:
    print("Warning: data/text.npy not found. Using randomly generated vectors instead.")
    _text_vectors_clean = []  # Empty list as fallback
    _using_real_vectors = False

try:
    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
    _query_vectors_unique = np.unique(_query_vectors, axis=0)
    _query_vectors_clean = _query_vectors_unique.tolist()
except FileNotFoundError:
    print("Warning: data/queries.npy not found. Using randomly generated vectors instead.")
    _query_vectors_clean = []  # Empty list as fallback

Then modify the functions to check _using_real_vectors and fall back to random generation when needed:

def sample_queries(n: int) -> list[np.array]:
+    if not _using_real_vectors or len(_query_vectors_clean) == 0:
+        # Fallback to random generation
+        return [np.random.random(384).round(3).tolist() for _ in range(n)]
    sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
    return [_query_vectors_clean[i] for i in sampled_vectors]

This would make the tests more robust and still allow them to run without the data files.

Also applies to: 18-24, 65-68, 124-124

🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.


⚠️ Potential issue

Fix the file path reference issue

The code is attempting to load data files with hardcoded paths, but the pipeline has failed with a FileNotFoundError: No such file or directory: 'data/text.npy'. This indicates that either:

  1. The data files are missing from the repository, or
  2. The path is incorrect relative to where the tests are executed

Consider one of these solutions:

  1. Include the data files in the PR or document how to obtain them
  2. Use a relative path that accounts for the project structure:
- _text_vectors = np.load("data/text.npy", mmap_mode="r")
+ _text_vectors = np.load(os.path.join(os.path.dirname(__file__), "../../data/text.npy"), mmap_mode="r")

Don't forget to add the import os statement at the top of the file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_text_vectors = np.load("data/text.npy", mmap_mode="r")
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
import os
_text_vectors = np.load(
os.path.join(os.path.dirname(__file__), "../../data/text.npy"),
mmap_mode="r"
)
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[
~np.isnan(_text_vectors_unique).any(axis=1)
].tolist()
🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.

Comment on lines 65 to 68
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
return [_text_vectors_clean[i] for i in sampled_vectors]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add dimension validation to sample_random_multivector

The sample_random_multivector function doesn't validate that the dimensions of the preloaded vectors match the requested vec_size.

Add dimension validation:

def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
    sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
+    # Verify vector dimensions match
+    for i in sampled_vectors:
+        if len(_text_vectors_clean[i]) != vec_size:
+            raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[i])} does not match requested dimension {vec_size}")
    return [_text_vectors_clean[i] for i in sampled_vectors]

This will help catch dimension mismatches early.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
return [_text_vectors_clean[i] for i in sampled_vectors]
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
# Verify vector dimensions match
for i in sampled_vectors:
if len(_text_vectors_clean[i]) != vec_size:
raise ValueError(
f"Preloaded vector dimension {len(_text_vectors_clean[i])} "
f"does not match requested dimension {vec_size}"
)
return [_text_vectors_clean[i] for i in sampled_vectors]

Comment on lines 18 to 24
def sample_queries(n: int) -> list[np.array]:
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors = _query_vectors_unique.tolist()
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
return [_query_vectors[i] for i in sampled_vectors]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve data loading efficiency and error handling

The sample_queries function loads the data from disk on every call, which is inefficient. Additionally, it lacks error handling for the file loading operation.

Consider loading the query vectors once at module level, similar to how you handle _text_vectors:

+ _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
+ _query_vectors_unique = np.unique(_query_vectors, axis=0)
+ _query_vectors_clean = _query_vectors_unique.tolist()

def sample_queries(n: int) -> list[np.array]:
-    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
-    _query_vectors_unique = np.unique(_query_vectors, axis=0)
-    _query_vectors = _query_vectors_unique.tolist()
-    sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
-    return [_query_vectors[i] for i in sampled_vectors]
+    sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
+    return [_query_vectors_clean[i] for i in sampled_vectors]

Also, add exception handling to provide better error messages if the files can't be loaded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def sample_queries(n: int) -> list[np.array]:
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors = _query_vectors_unique.tolist()
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
return [_query_vectors[i] for i in sampled_vectors]
# Module‐level loading & preprocessing of query vectors
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors_clean = _query_vectors_unique.tolist()
def sample_queries(n: int) -> list[np.array]:
sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
return [_query_vectors_clean[i] for i in sampled_vectors]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tests/fixtures/points.py (3)

15-17: Fix the file path reference and add error handling

Your code is attempting to load data files with hardcoded paths, but this approach is fragile and may fail in different environments. The absence of error handling will cause tests to fail unexpectedly when files are missing.

Run this script to check the file access issue:

#!/bin/bash
# Check if the data files exist in the expected locations
echo "Current directory: $(pwd)"
echo "Checking for data files:"
ls -la data/ 2>/dev/null || echo "data/ directory not found"
echo "Searching for .npy files:"
find . -name "*.npy" -type f

20-27: Improve data loading efficiency in sample_queries

The function loads the data file on every call, which is inefficient. Similar to how you handle _text_vectors, consider loading the query vectors once at module level.


69-73: Add dimension validation to sample_random_multivector

The function doesn't validate that the dimensions of the preloaded vectors match the requested vec_size, which could lead to dimension mismatches during testing.

🧹 Nitpick comments (4)
tests/fixtures/points.py (4)

148-148: Fix unused loop variable

The loop control variable name is not used within the loop body.

-                for name, vec in vectors.items():
+                for _name, vec in vectors.items():
🧰 Tools
🪛 Ruff (0.11.9)

148-148: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


152-152: Fix unused loop variable

The loop control variable name is not used within the loop body.

-                for name, vec in enumerate(vectors):
+                for _name, vec in enumerate(vectors):
🧰 Tools
🪛 Ruff (0.11.9)

152-152: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


30-35: Add type annotation for the idx parameter

The idx parameter lacks a type annotation, making it unclear what type is expected.

-def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct:
+def random_vectors(vector_sizes: Union[dict[str, int], int], idx: Optional[int] = None) -> models.VectorStruct:

Don't forget to add from typing import Optional at the top of the file.


130-132: Consider adding a docstring to describe vector sampling

The code samples vectors from preloaded data rather than generating them randomly, which is a significant change in behavior. Consider adding a docstring to explain this new approach.

 def generate_points(
     num_points: int,
     vector_sizes: Union[dict[str, int], int],
     with_payload: bool = False,
     random_ids: bool = False,
     skip_vectors: bool = False,
     sparse: bool = False,
     even_sparse: bool = True,
     multivector: bool = False,
 ) -> list[models.PointStruct]:
+    """
+    Generate points with vectors for testing.
+    
+    When not using sparse or multivector options, this function samples vectors from
+    preloaded data rather than generating them randomly, ensuring more realistic test data.
+    """
     if skip_vectors and isinstance(vector_sizes, int):
         raise ValueError("skip_vectors is not supported for single vector")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 888d94e and 5d304e8.

📒 Files selected for processing (5)
  • tests/congruence_tests/test_collections.py (2 hunks)
  • tests/congruence_tests/test_common.py (1 hunks)
  • tests/embed_tests/test_local_inference.py (1 hunks)
  • tests/fixtures/points.py (5 hunks)
  • tests/test_fastembed.py (3 hunks)
✅ Files skipped from review due to trivial changes (2)
  • tests/test_fastembed.py
  • tests/embed_tests/test_local_inference.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/congruence_tests/test_collections.py
  • tests/congruence_tests/test_common.py
🧰 Additional context used
🪛 Ruff (0.11.9)
tests/fixtures/points.py

148-148: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


152-152: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (1)
tests/fixtures/points.py (1)

147-154: Approve NaN and type assertions

Good addition of assertions to validate that vectors don't contain NaN values and have the appropriate numeric types. This helps catch data quality issues early in testing.

🧰 Tools
🪛 Ruff (0.11.9)

148-148: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


152-152: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
tests/fixtures/points.py (4)

77-81: Add dimension validation to sample_random_multivector

The sample_random_multivector function doesn't validate that the dimensions of the preloaded vectors match the requested vec_size.

Add dimension validation:

def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
    doc_vectors = _text_vectors_clean.copy()
+    # Verify vector dimensions match
+    if len(doc_vectors) > 0 and len(doc_vectors[0]) != vec_size:
+        raise ValueError(f"Preloaded vector dimension {len(doc_vectors[0])} does not match requested dimension {vec_size}")
    sampled_vectors = np.random.choice(len(doc_vectors), size=vec_count, replace=False)
    return [np.array(doc_vectors[i]).astype(np.float32).tolist() for i in sampled_vectors]

16-23: ⚠️ Potential issue

Fix the file path reference issue

The code is attempting to load data files with hardcoded paths, but the pipeline has failed with a FileNotFoundError: No such file or directory: 'data/text.npy'. This indicates that either:

  1. The data files are missing from the repository, or
  2. The path is incorrect relative to where the tests are executed

Consider one of these solutions:

  1. Include the data files in the PR or document how to obtain them
  2. Use a relative path that accounts for the project structure:
- _text_vectors = np.load("data/text.npy", mmap_mode="r")[..., :text_vector_size]
+ _text_vectors = np.load(os.path.join(os.path.dirname(__file__), "../../data/text.npy"), mmap_mode="r")[..., :text_vector_size]

Don't forget to add the import os statement at the top of the file.

🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 17-17: FileNotFoundError: No such file or directory: 'data/text.npy' - required data file missing causing test collection failure.


25-25: 🛠️ Refactor suggestion

Add fallback mechanism for missing data files

Currently, if the data files are missing, the test will fail completely. Consider adding a fallback mechanism that uses randomly generated vectors when the data files are not found.

Implement a try-except block to gracefully handle missing data files:

-_text_vectors_clean = generate_vectors()
+try:
+    _text_vectors_clean = generate_vectors()
+    _using_real_vectors = True
+except FileNotFoundError:
+    print("Warning: data/text.npy not found. Using randomly generated vectors instead.")
+    # Generate fallback random vectors
+    _text_vectors_clean = [np.random.random(text_vector_size).tolist() for _ in range(100)]
+    _using_real_vectors = False

28-36: 🛠️ Refactor suggestion

Improve data loading efficiency and error handling

The sample_queries function loads the data from disk on every call, which is inefficient. Additionally, it lacks error handling for the file loading operation.

Consider loading the query vectors once at module level, similar to how you handle _text_vectors_clean:

+try:
+    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)[
+        ..., :text_vector_size
+    ]
+    _query_vectors_unique = np.unique(_query_vectors, axis=0)
+    _query_vectors_clean = _query_vectors_unique.tolist()
+except FileNotFoundError:
+    print("Warning: data/queries.npy not found. Using randomly generated vectors instead.")
+    _query_vectors_clean = [np.random.random(text_vector_size).tolist() for _ in range(100)]

def sample_queries(n: int) -> list[np.array]:
-    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)[
-        ..., :text_vector_size
-    ]
-    _query_vectors_unique = np.unique(_query_vectors, axis=0)
-    _query_vectors = _query_vectors_unique.tolist()
-    sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
-    return [_query_vectors[i].copy() for i in sampled_vectors]
+    sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
+    return [_query_vectors_clean[i].copy() for i in sampled_vectors]
🧹 Nitpick comments (2)
tests/fixtures/points.py (2)

156-159: Fix unused loop variable

The loop control variable name is not used within the loop body.

-                for name, vec in vectors.items():
+                for _name, vec in vectors.items():
                     assert np.array(vec).dtype.kind in ("f", "i")  # float or int
                     assert not np.isnan(np.array(vec)).any()
🧰 Tools
🪛 Ruff (0.11.9)

156-156: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


160-163: Fix unused loop variable

The loop control variable name is not used within the loop body.

-                for name, vec in enumerate(vectors):
+                for _name, vec in enumerate(vectors):
                     assert np.array(vec).dtype.kind in ("f", "i")  # float or int
                     assert not np.isnan(np.array(vec)).any()
🧰 Tools
🪛 Ruff (0.11.9)

160-160: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d304e8 and 7c577db.

📒 Files selected for processing (2)
  • tests/congruence_tests/test_search.py (1 hunks)
  • tests/fixtures/points.py (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
tests/congruence_tests/test_search.py (1)
tests/fixtures/points.py (1)
  • generate_vectors (16-22)
tests/fixtures/points.py (1)
qdrant_client/http/models/models.py (1)
  • SparseVector (2706-2712)
🪛 Ruff (0.11.9)
tests/fixtures/points.py

156-156: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


160-160: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

🪛 GitHub Actions: Integration tests
tests/fixtures/points.py

[error] 17-17: FileNotFoundError: No such file or directory: 'data/text.npy' - required data file missing causing test collection failure.

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (3)
tests/fixtures/points.py (2)

13-13: Good addition of a centralized vector size constant.

This makes the vector size consistent across tests, improving maintainability.


38-51: LGTM! Good implementation of optional index parameter.

The modification allows for reusing preloaded vectors when an index is provided, which improves consistency across tests.

tests/congruence_tests/test_search.py (1)

19-19: LGTM! Good import addition.

The import of generate_vectors from tests.fixtures.points aligns with the goal of standardizing vector generation across tests.

Comment on lines +26 to +31
_text_vectors = generate_vectors()
sampled_vectors = np.random.choice(len(_text_vectors), size=3, replace=False)

self.query_text = _text_vectors[sampled_vectors[0]]
self.query_image = _text_vectors[sampled_vectors[1]]
self.query_code = _text_vectors[sampled_vectors[2]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Improve test reliability by handling potential missing data files

The test now depends on external data files, which might not be available in all environments. Consider adding error handling for when generate_vectors() fails.

This dependency on external data files could make the tests fail in environments where the data files aren't present. Make sure this is compatible with the test strategy for the project.


🏁 Script executed:

#!/bin/bash
# Check if the test is trying to access the data files
grep -r "data/text.npy" --include="*.py" .
grep -r "data/queries.npy" --include="*.py" .

# Check if these files exist in the repository
fd "text.npy" .
fd "queries.npy" .

Length of output: 349


Handle missing data files in tests/fixtures/points.py

The helper in tests/fixtures/points.py loads

np.load("data/text.npy", …)
np.load("data/queries.npy", …)

but neither file is bundled in the repo. In environments without those .npy files, generate_vectors() will raise FileNotFoundError and break the tests. To improve reliability, please choose one of the following:

  • Wrap each np.load in a try/except that catches FileNotFoundError and calls pytest.skip(…) with a clear message.
  • Add sample .npy files under tests/fixtures/data/ (and update the paths accordingly).
  • Mock or patch np.load in your test suite so that generate_vectors() doesn’t depend on external files.
🤖 Prompt for AI Agents
In tests/congruence_tests/test_search.py around lines 26 to 31, the call to
generate_vectors() depends on external .npy data files that may be missing in
some environments, causing the test to fail. To fix this, add error handling
around generate_vectors() to catch FileNotFoundError and call pytest.skip() with
a clear message explaining the missing data files. This will gracefully skip the
test when the required files are not present, improving test reliability.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/fixtures/points.py (2)

16-23: ⚠️ Potential issue

Hard-coded data path still crashes the test run – add robust fallback logic

generate_vectors() will raise FileNotFoundError during import time whenever data/queries_clean.npy is absent (see current CI failure).
This exact problem was flagged in the previous review but remains unresolved.

-def generate_vectors():
-    _text_vectors = np.load("data/queries_clean.npy", mmap_mode="r")[..., :text_vector_size]
-    ...
-    return _text_vectors_clean
+def generate_vectors() -> list[list[float]]:
+    """Load & sanitise vectors, falling back to an empty list when the file is missing."""
+    data_path = os.path.join(os.path.dirname(__file__), "../../data/queries_clean.npy")
+    try:
+        _text_vectors = np.load(data_path, mmap_mode="r")[..., :text_vector_size]
+    except FileNotFoundError:
+        print(f"⚠  {data_path} not found – falling back to random vectors.")
+        return []
+
+    _text_vectors_unique = np.unique(_text_vectors, axis=0)
+    _text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)]
+    return _text_vectors_clean.tolist()

Don’t forget to import os at the top of the file.

🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 17-17: FileNotFoundError: No such file or directory: 'data/queries_clean.npy' - required numpy data file missing causing test collection failure.


28-35: 🛠️ Refactor suggestion

⚠️ Potential issue

Repeated disk-loads, path issues and missing-file crashes in sample_queries

  1. The vector file name (text_clean.npy) conflicts with the "queries_clean.npy" used above—please double-check the intent.
  2. The file is loaded every time sample_queries is called, which is wasteful.
  3. No error handling means the entire test suite still explodes when the file is missing.

Proposed refactor (loads once, shares the same robust fallback pattern):

- def sample_queries(n: int) -> list[np.array]:
-     _query_vectors = np.load("data/text_clean.npy", allow_pickle=True).astype(np.float32)[..., :text_vector_size]
-     _query_vectors_unique = np.unique(_query_vectors, axis=0)
-     _query_vectors = _query_vectors_unique.tolist()
-     sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
-     return [_query_vectors[i].copy() for i in sampled_vectors]
+
+try:
+    _query_vectors_clean: list[list[float]] = (
+        np.load(os.path.join(os.path.dirname(__file__), "../../data/text_clean.npy"), allow_pickle=True)
+        .astype(np.float32)[..., :text_vector_size]
+        .tolist()
+    )
+except FileNotFoundError:
+    _query_vectors_clean = []
+
+def sample_queries(n: int) -> list[np.array]:
+    if not _query_vectors_clean:
+        return [np.random.random(text_vector_size).astype(np.float32).tolist() for _ in range(n)]
+
+    sampled = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
+    return [_query_vectors_clean[i].copy() for i in sampled]

This removes the hot-path I/O and keeps tests runnable without external data.

🧹 Nitpick comments (2)
tests/fixtures/points.py (2)

70-74: Dead code: generate_random_multivector is no longer referenced

The helper is never called after migrating to sample_random_multivector. Consider deleting it to keep the fixture lean.


158-165: Rename unused loop variables to appease Ruff and improve clarity

Both loops iterate only for the side-effects of assertions; the key/index is unused.

-                for name, vec in vectors.items():
+                for _unused, vec in vectors.items():
...
-                for name, vec in enumerate(vectors):
+                for _idx, vec in enumerate(vectors):
🧰 Tools
🪛 Ruff (0.11.9)

159-159: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


163-163: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c577db and 71ad426.

📒 Files selected for processing (2)
  • tests/congruence_tests/test_multivector_recommend_queries.py (4 hunks)
  • tests/fixtures/points.py (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/congruence_tests/test_multivector_recommend_queries.py (1)
qdrant_client/http/models/models.py (1)
  • RecommendStrategy (2056-2066)
🪛 Ruff (0.11.9)
tests/fixtures/points.py

159-159: Loop control variable name not used within loop body

Rename unused name to _name

(B007)


163-163: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

🪛 GitHub Actions: Integration tests
tests/fixtures/points.py

[error] 17-17: FileNotFoundError: No such file or directory: 'data/queries_clean.npy' - required numpy data file missing causing test collection failure.

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (5)
tests/congruence_tests/test_multivector_recommend_queries.py (5)

157-157: Expanded negative examples list for more comprehensive testing

This change adds an additional negative example ID (14) to the only_negatives_best_score_recommend_euclid method, which aligns with the PR objective of enhancing tests with more realistic vector configurations. Using more example points provides a more comprehensive test for the best score recommendation strategy.


171-171: Enhanced positive examples for sum_scores recommendation testing

The addition of a third positive example (30) to the sum_scores_recommend method provides a more robust test case for the sum scores recommendation strategy. This change is consistent with the PR's goal of using real vectors for testing and helps ensure the recommendation algorithm works correctly with multiple input vectors.


205-207: Expanded negative examples list for sum_scores recommendation

This change adds a third negative example ID (14) to the only_negatives_sum_scores_recommend method, consistent with the other modifications in this PR. Using three negative examples instead of two provides a more thorough test of the sum_scores recommendation strategy with negative-only inputs.


224-224: Added third negative example to sum_scores recommendation with Euclidean distance

The addition of a third negative example ID (14) to the only_negatives_sum_scores_recommend_euclid method maintains consistency with the other test methods and provides more comprehensive testing for the sum_scores strategy using Euclidean distance.


271-319:

✅ Verification successful

Verify test consistency with expanded vector examples

The main test function test_simple_recommend() has remained unchanged, which is good as it ensures the testing workflow remains consistent. However, it would be beneficial to verify that the expanded number of example vectors (adding ID 14 to negative examples and ID 30 to positive examples) doesn't unexpectedly alter test outcomes.


🏁 Script executed:

#!/bin/bash
# Check if the added vector IDs (14, 30) are valid and exist in the test fixtures

# Search for references to the generation of test fixtures
rg -A 5 "generate_multivector_fixtures" tests/

Length of output: 15913


🏁 Script executed:

#!/bin/bash
# Display the full implementation of generate_multivector_fixtures for ID assignment logic
rg -n "def generate_multivector_fixtures" -A200 tests/congruence_tests/test_common.py

Length of output: 9301


No action needed: fixture ID expansion won’t affect consistency checks

We’ve verified that generate_multivector_fixtures simply delegates to generate_points with sequential IDs (default random_ids=False), so adding more vectors only extends the ID range without altering existing points. All assertions in test_simple_recommend() compare local vs. remote client behavior (not specific IDs), so expanding the fixture set will not change test outcomes.

Comment on lines +38 to +50
def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct:
if isinstance(vector_sizes, int):
return np.random.random(vector_sizes).round(3).tolist()
if idx:
return _text_vectors_clean[idx].copy()
else:
return np.random.random(vector_sizes).tolist() # .round(3)
elif isinstance(vector_sizes, dict):
vectors = {}
for vector_name, vector_size in vector_sizes.items():
vectors[vector_name] = np.random.random(vector_size).round(3).tolist()
if idx:
vectors[vector_name] = _text_vectors_clean[idx].copy()
else:
vectors[vector_name] = np.random.random(vector_size).tolist() # .round(3)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

idx falsy check causes silent mis-sampling when idx == 0

if idx: treats 0 as “not supplied”, so the very first sampled index falls back to a random vector instead of the expected pre-loaded one.

-        if idx:
+        if idx is not None:
             return _text_vectors_clean[idx].copy()

Apply the same fix inside the dictionary branch (lines 47-48).

🤖 Prompt for AI Agents
In tests/fixtures/points.py around lines 38 to 50, the condition checking if idx
is supplied uses a falsy check (if idx:), which incorrectly treats idx == 0 as
not supplied, causing silent mis-sampling. Change the condition to explicitly
check for None (if idx is not None:) in both the integer and dictionary branches
to correctly handle idx == 0 and ensure the pre-loaded vector is returned when
idx is zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant